Abstract
Introduction
What is dp? Data provenance has been formally studied throughout the database community since the early 2000s and at present, marks itself an important component to both Open Science and Open Government research and technological policy formulation. The importance of data provenance lies in the ability to track the lineage of a piece of data, and understand how it transformed over space and time. Data provenance (also known/associated with data lineage) describes the inputs, outputs and intermediate steps involved in producing a result. Data provenance can be used to track errors, support reproducibility and improve understanding of data processing systems. Data provenance has been studied in a variety of disciplines, including databases , digital libraries , and scientific workflow systems .
Dataprovenance (sometimes spelled data provenance or data-provenance) is defined as: “the documented history of data derived from the electronic (computer-readable) records of successive processes or transformations applied to other electronic records.”Dataprovenance is concerned with tracking the inputs and outputs of computational processes (programs or scripts) that manipulate data. The provenance of digital data can be captured manually or automatically. Automated capture is generally less error-prone, but manual capture may be necessary when automated capture is not possible or practical.
One common format for representing dataprovenance is the PROV data model, which is part of the W3C PROV family of standards. Another is the Open Provenance Model (OPM).
Why is dp relevent for sars-cov2? what this study does…
In the context of SARS-CoV-2 genome sharing, data provenance would allow scientists to track where each piece of genomic data came from, who created it, and when it was last modified. This would help ensure that data is accurately attributed and that any changes made are transparent and traceable.
Methods
Two types of data are necessary in order to map the high level data
provenance pipeline for each data portal: (1) software and security
information regarding the portals infrastructure and (2) records of
complete metadata for a given sequence. In order to obtain the relevant
software and security information regarding each data portal, a
penetration test (pen test) was conducted using the open source
whatWeb command line tool. A penetration test is commonly
used to find and exploit vulnerabilities in a computational system.
While the aim of this is not to find the vulnerabilities of the data
portals, the results from pen test map out many of the infrastructures
being used to host the data.
For each data portal the following command was used:
./whatweb -v https://www.dataportal-url.org -a 1
Where -v gives a verbose output of the results and
-a 1 represents a soft level of pen test. The results for
each data portal can be found in the following code chunks.
The Covid-19 Data Portal
WhatWeb report for https://www.covid19dataportal.org/
Status : 200 OK
Title : COVID-19 Data Portal - accelerating scientific research through data
IP : 193.62.193.83
Country : UNITED KINGDOM, GB
Summary : HTML5, UncommonHeaders[x-content-type-options,x-download-options,x-permitted-cross-domain-policies,referrer-policy,access-control-allow-origin], Apache, Google-Analytics[Universal][UA-163982534-1], HTTPServer[Apache], X-Frame-Options[SAMEORIGIN], Strict-Transport-Security[max-age=63072000; includeSubDomains; preload], Script[module], X-XSS-Protection[1; mode=block]
Detected Plugins:
[ Apache ]
The Apache HTTP Server Project is an effort to develop and
maintain an open-source HTTP server for modern operating
systems including UNIX and Windows NT. The goal of this
project is to provide a secure, efficient and extensible
server that provides HTTP services in sync with the current
HTTP standards.
Google Dorks: (3)
Website : http://httpd.apache.org/
[ Google-Analytics ]
This plugin identifies the Google Analytics account.
Version : Universal
Account : UA-163982534-1
Website : http://www.google.com/analytics/
[ HTML5 ]
HTML version 5, detected by the doctype declaration
[ HTTPServer ]
HTTP server header string. This plugin also attempts to
identify the operating system from the server header.
String : Apache (from server string)
[ Script ]
This plugin detects instances of script HTML elements and
returns the script language/type.
String : module
[ Strict-Transport-Security ]
Strict-Transport-Security is an HTTP header that restricts
a web browser from accessing a website without the security
of the HTTPS protocol.
String : max-age=63072000; includeSubDomains; preload
[ UncommonHeaders ]
Uncommon HTTP server headers. The blacklist includes all
the standard headers and many non standard but common ones.
Interesting but fairly common headers should have their own
plugins, eg. x-powered-by, server and x-aspnet-version.
Info about headers can be found at www.http-stats.com
String : x-content-type-options,x-download-options,x-permitted-cross-domain-policies,referrer-policy,access-control-allow-origin (from headers)
[ X-Frame-Options ]
This plugin retrieves the X-Frame-Options value from the
HTTP header. - More Info:
http://msdn.microsoft.com/en-us/library/cc288472%28VS.85%29.
aspx
String : SAMEORIGIN
[ X-XSS-Protection ]
This plugin retrieves the X-XSS-Protection value from the
HTTP header. - More Info:
http://msdn.microsoft.com/en-us/library/cc288472%28VS.85%29.
aspx
String : 1; mode=block
HTTP Headers:
HTTP/1.1 200 OK
Date: Mon, 31 Oct 2022 17:34:27 GMT
Server: Apache
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
X-Download-Options: noopen
X-Permitted-Cross-Domain-Policies: none
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Referrer-Policy: same-origin
Access-Control-Allow-Origin: *
Last-Modified: Mon, 31 Oct 2022 08:09:31 GMT
ETag: "263a-5ec502092487f-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 2714
Connection: close
Content-Type: text/html; charset=UTF-8
GISAID
WhatWeb report for https://epicov.org/
Status : 301 Moved Permanently
Title : 301 Moved Permanently
IP : 104.22.50.160
Country : UNITED STATES, US
Summary : UncommonHeaders[cf-cache-status,cf-ray], HTTPServer[cloudflare], RedirectLocation[https://www.gisaid.org/]
Detected Plugins:
[ HTTPServer ]
HTTP server header string. This plugin also attempts to
identify the operating system from the server header.
String : cloudflare (from server string)
[ RedirectLocation ]
HTTP Server string location. used with http-status 301 and
302
String : https://www.gisaid.org/ (from location)
[ UncommonHeaders ]
Uncommon HTTP server headers. The blacklist includes all
the standard headers and many non standard but common ones.
Interesting but fairly common headers should have their own
plugins, eg. x-powered-by, server and x-aspnet-version.
Info about headers can be found at www.http-stats.com
String : cf-cache-status,cf-ray (from headers)
HTTP Headers:
HTTP/1.1 301 Moved Permanently
Date: Mon, 31 Oct 2022 17:39:48 GMT
Content-Type: text/html; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: close
Location: https://www.gisaid.org/
CF-Cache-Status: DYNAMIC
Server: cloudflare
CF-RAY: 762e2c16af9e188f-MAN
WhatWeb report for https://www.gisaid.org/
Status : 403 Forbidden
Title : 403 Forbidden
IP : 78.46.3.243
Country : GERMANY, DE
Summary : Apache, HTTPServer[Apache]
Detected Plugins:
[ Apache ]
The Apache HTTP Server Project is an effort to develop and
maintain an open-source HTTP server for modern operating
systems including UNIX and Windows NT. The goal of this
project is to provide a secure, efficient and extensible
server that provides HTTP services in sync with the current
HTTP standards.
Google Dorks: (3)
Website : http://httpd.apache.org/
[ HTTPServer ]
HTTP server header string. This plugin also attempts to
identify the operating system from the server header.
String : Apache (from server string)
HTTP Headers:
HTTP/1.1 403 Forbidden
Date: Mon, 31 Oct 2022 17:39:50 GMT
Server: Apache
Content-Length: 264
Connection: close
Content-Type: text/html; charset=iso-8859-1
In order to access data and metadata records, a manual collection of data and metadata was carried out for each platform on October 30th 2022. To standardise this process, each manual collection was led by the accession of the same sequence for each portal, the number of “hops” were then recorded to show the infrastructural lineage of the data, as well as documenting supplementary metadata associated with a given sequence. Data accessed through GISAID was conducted following their terms of use policy.
Results and Discussion
GISAID Data Provenance Pipeline: A data user will most likely start their journey using the GISAID platform, hosted on highly secure and free/OS software in Germany. From here a data user will need to sign up for the EpiCov3 database which is the primary way to access GISAID data. The EpiCov3 database is hosted on a highly secure, closed software in the United States. Aggregated data is summarised in nine columns which reference the three sub metadata categories. One important thing to notice here, is the direct mapping between the data and metadata. In most cases, a column in the aggreagted data will directly reference and match to a metadata sub category that will contain addtional and often more private information. It should be noted that there also exists one solated metadata category, which contains highly personal information regarding the submitter of the sequence
The Covid-19 Data Portal Data Provenance Pipeline: A data user will most likely start their journey using the Covid-19 Data portal, hosted on a fairly secure LAMP stack in the United Kingdom. From here a data user is able to use the data portal to openly query sequence data. However, when they do this request does not come from the Covid-19 Data Portals infractrucutre itself, rather it queries the European Necuolide Archive database and returns back aggreagated data. In order to explore the metadata associated with a seqeunce, the data user will then be redriected to the ENA’s app. An important thing to note here, is the dissonance between the aggreagted data names and the associated metadata. While some common labels stay the same e.g. acession ID, many take new names after being pooled to the Covid-19 Data portal e.g. center name (data) -> collection institution (metadata). This linkage stratgey blurs the direct matches and references between the data and the five metadata sub categories. This being said, two of the metadata categories hosted by the ENA contain linked if a sequence has been linked to a particualr study, sample and taxon.